Handling Data Skew in MapReduce
نویسندگان
چکیده
MapReduce systems have become popular for processing large data sets and are increasingly being used in e-science applications. In contrast to simple application scenarios like word count, e-science applications involve complex computations which pose new challenges to MapReduce systems. In particular, (a) the runtime complexity of the reducer task is typically high, and (b) scientific data is often skewed. This leads to highly varying execution times for the reducers. Varying execution times result in low resource utilisation and high overall execution time since the next MapReduce cycle can only start after all reducers are done. In this paper we address the problem of efficiently processing MapReduce jobs with complex reducer tasks over skewed data. We define a new cost model that takes into account non-linear reducer tasks and we provide an algorithm to estimate the cost in a distributed environment. We propose two load balancing approaches, fine partitioning and dynamic fragmentation, that are based on our cost model and can deal with both skewed data and complex reduce tasks. Fine partitioning produces a fixed number of data partitions, dynamic fragmentation dynamically splits large partitions into smaller portions and replicates data if necessary. Our approaches can be seamlessly integrated into existing MapReduce systems like Hadoop. We empirically evaluate our solution on both synthetic data and real data from an e-science application.
منابع مشابه
Handling Data Skew in MapReduce Cluster by Using Partition Tuning
The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data pro...
متن کاملHandling Skew in Multiway Joins in Parallel Processing
Handling skew is one of the major challenges in query processing. In distributed computational environments such as MapReduce, uneven distribution of the data to the servers is not desired. One of the dominant measures that we want to optimize in distributed environments is communication cost. In a MapReduce job this is the amount of data that is transferred from the mappers to the reducers. In...
متن کاملFine-Grained Micro-Tasks for MapReduce Skew-Handling
Recent work on MapReduce has considered the problems of skew, where a job’s tasks exhibit large variance in size and processing cost, and stragglers, tasks that run slowly due to conditions on particular nodes. In this paper, we discuss an extremely simple approach to mitigating skew and stragglers: break the workload into many small tasks that are dynamically scheduled at runtime. This approac...
متن کاملA Survey on Partitioning Skew Diminishing Techniques in Hadoop MapReduce Environment
In the era of Big Data, it creates large size of structured and unstructured data. MapReduce is an effective tool for parallel data processing. One significant issue in practical MapReduce applications is data skew: the imbalance in the amount of data assigned to each task. This causes some tasks to take much longer to finish than others and can significantly impact performance. Parallel data p...
متن کاملHandling Data Skew in Map Reduce Using Hadoop Libra
There are many efficient tools significantly uses Map Reduce applications that assigns data with their corresponding tasks in parallel and distributed data processing. LIBRA symbolizes the lightweight problems of data skew with input data applications that can overlap map and reduce strategies. This is one of the innovative and accurate distribution methods for intermediate data sampling with n...
متن کاملHandling partitioning skew in MapReduce using LEEN
MapReduce is emerging as a prominent tool for big data processing. Locality is a key feature in MapReduce that is extensively leveraged in dataintensive cloud system: it avoids network saturation when processing large amount of data by co-allocating computation and data storage — the map phase. However, our studies with Hadoop, a widely used MapReduce implementation, demonstrate that the presen...
متن کامل